In this R Markdown document, we will analyze various attributes of Red wine and how they influence quality of the Red wine.
We are provided with Red wine data set with 1,599 observations and each record has following attributes:
Following assumptions were made during the analysis:
As we progress in the document we will:
Lets load data in CSV format into R data frame and analyze data structure:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Even though quality is loaded as of type int, the numeric values are finite and are within range 0-10. So, we will convert the quality into factor type (categorical value).
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
Now that we have fixed data type of columns, let’s check if any of the columns contain missing values.
## [1] FALSE
From the above code snippet, we are sure that there are no missing values, now we can proceed analyzing the data.
We will be using ggplot2 library to perform Univariant Plotting.
Let’s start analyzing range and distribution of all quantitative attributes in the data set.
From the below summary report, we have fixed acidity in our current data set in the range of 4.6 to 15.9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Now let’s analyze the frequency distribution of fixed acidity captured in the data set. Below bar chart depicts that most of the wine in the data set has fixed acidity between 7 and 9, which is in conjunction with the IQR calculated in the summary.
From the below summary report, we have Volatile Acidity in our current data set in the range of 0.12 to 1.58
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Below bar chart depicts frequency distribution of volatile acidity of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have citric acid measurements in our current data set in the range of 0 to 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Below bar chart depicts frequency distribution of citric acid measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have residual sugar measurements (grams per litre) in our current data set in the range of 0.9 to 15.5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Below bar chart depicts frequency distribution of residual sugar measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have chloride measurements in our current data set in the range of 0.012 to 0.611
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Below bar chart depicts frequency distribution of chloride measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have Free Sulfur Dioxide measurements (parts per million) in our current data set in the range of 1 to 72
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Below bar chart depicts frequency distribution of Free Sulfur Dioxide measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have Total Sulfur Dioxide measurements in our current data set in the range of 6 to 289
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Below bar chart depicts frequency distribution of Total Sulfur Dioxide measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have Density measurements in our current data set in the range of 0.9901 to 1.0037
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
Below bar chart depicts that most of the wine in the data set has fixed acidity between 0.99 and 1, which is in conjunction with the IQR calculated in the summary.
From the below summary report, we have pH measurements in our current data set in the range of 2.74 to 4.01
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Below bar chart depicts frequency distribution of pH measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have Sulphates measurements in our current data set in the range of 0.33 to 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Below bar chart depicts frequency distribution of Sulphates measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
From the below summary report, we have Alcohol measurements in our current data set in the range of 8.4 to 14.9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Below bar chart depicts frequency distribution of Alcohol measurements of Wines in the data set. On quick examining we see outliers in the distribution. That explains the relatively huge difference between 3rd quartile and maximum value.
Below bar chart depicts frequency distribution of Quality counts in the data set.
Till now we have analyzed the range and frequency distribution of attributes in isolation strictly from statistical point of view. Going forward lets analyze the key attributes of wine and how they interpolate.
Note: All the attributes are curbed at 99th percentile to ignore outliers.
Acidity is a fundamental property of wine, imparting sourness and resistance to microbial infection. Doug Nierman, 2004
Acids are major wine constituents and contribute greatly to its taste. Traditionally total acidity is divided into two groups, namely the volatile acids and the nonvolatile or fixed acids. Fixed acids originate from grapes, higher the fixed acidity content sourer the wine will be.
Fixed Acidity is inversely proportional to pH scale (0 (very acidic) to 14 (very basic)). Below scatter plot depicts the same.
Citric acid found in small quantities, citric acid can add ‘freshness’ and flavor to wines. Addition of Citric acid will increase overall fixed acidity.
Volatile acidity is primarily acetic acid generated during fermentation process. Acetic acid will result in the concomitant formation of other, sometimes unpleasant, aroma compounds. One of the preventive methodologies reduce amount of acetic acid is to inject Sulphates during fermentation process.
Below scatter plots depicts as the amount of Sulphur content (any form) in wine varies with the level of volatile acidity. Even for the same amount of Sulphur content, we are seeing volatile acidity varying in wide range, which suggests that other factors (may or may not be captured in the data set) that are influencing volatile acidity.
Now that we have analyzed how related attributes are influencing each other. We will continue by plotting and analyzing how each attribute influence the quality of the wine. We will utilize box plots for this analysis.
Below box plot depicts the distribution of fixed acidity for each of the quality levels. From the plot:
Below box plot depicts the distribution of volatile acidity for each of the quality levels. From the plot:
Below box plot depicts the distribution of citric acidity for each of the quality levels. From the plot:
Below box plot depicts the distribution of residual sugar for each of the quality levels. From the plot:
Below box plot depicts the distribution of sulphates for each of the quality levels. From the plot:
Below box plot depicts the distribution of pH levels for each of the quality levels. From the plot:
Below box plot depicts the distribution of alcohol percentage for each of the quality levels. From the plot:
Till now we have analyzed attributes of wine in isolation, how attributes with in the same class affect each other in detail. Now let’s analyze how grouped attributes influence the quality of the wine.
Just like any other business the goal for producing red wines will be to produce high quality wines. Now let’s analyze how grouped attributes influence the quality of the wine.
We will analyze keeping alcohol content as primary parameter i.e, for a given alcohol level how other attributes influence quality of the wine.
Along with plotting the values from dataset, we will model the data using stat_smooth function. We will let the function to auto select the method and formula with 95% confidence level. This will be good starting point to visually identify trends.
Below couple of plots represent how quality is influenced by Alcohol and Residual Sugar along with model to predict potential quality range for give alcohol and residual sugar values. Wines with higher sugar level will add sweetness to the wine.
From the above plots we can deduce:
Below couple of plots represent how quality is influenced by Alcohol and Citric Acid along with model to predict potential quality range for give alcohol and residual sugar values. Citric acid will add freshness and flavor to the wine. Not all wines contain citric acid, for better modeling we will ignore the wines with no Citric Acid content.
From the above plots we can deduce:
Below couple of plots represent how quality is influenced by Alcohol and Volatile Acidity along with model to predict potential quality range for give alcohol and volatile acidity values. Wines with higher volatile acidity levels lead to unpleasant aroma.
From the above plots we can deduce:
Red wine data set is relatively small in size with only 1,599 observations with 11 attributes. Data set is very clean and did not require any data wrangling.
As the data set size was small and there was no complete documentation, domain knowledge to analyze the data was acquired from http://waterhouse.ucdavis.edu/whats-in-wine/red-wine-composition.
Even though document has mentioned how greatly volatile acidity, citric acid influence quality of the wine the data set did not support the hypothesis. For both volatile acidity and citric acid even
Even though document has mentioned that higher volatile acidity will cause bad aroma, dure to high alcohol content the wines are rated as of high quality. Wines with lower Citric Acid content are also marked of higher quality if the alcohol content is more.
Following assumptions were made during the analysis:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.